Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move partialsuccess code to internal package #3146

Merged
merged 2 commits into from
Sep 8, 2022

Conversation

MrAlias
Copy link
Contributor

@MrAlias MrAlias commented Sep 6, 2022

Addresses this unresolved comment originally posted by @MrAlias in #3106 (comment)

@MrAlias MrAlias added the pkg:exporter:otlp Related to the OTLP exporter package label Sep 6, 2022
@MrAlias MrAlias added this to the Release v1.10.0 milestone Sep 6, 2022
@MrAlias MrAlias added the Skip Changelog PRs that do not require a CHANGELOG.md entry label Sep 6, 2022
@codecov
Copy link

codecov bot commented Sep 6, 2022

Codecov Report

Merging #3146 (bcc4302) into main (569f743) will increase coverage by 0.0%.
The diff coverage is 100.0%.

Additional details and impacted files

Impacted file tree graph

@@          Coverage Diff          @@
##            main   #3146   +/-   ##
=====================================
  Coverage   76.4%   76.5%           
=====================================
  Files        180     180           
  Lines      12014   12014           
=====================================
+ Hits        9190    9192    +2     
+ Misses      2583    2581    -2     
  Partials     241     241           
Impacted Files Coverage Δ
exporters/otlp/internal/partialsuccess.go 100.0% <ø> (ø)
exporters/otlp/otlptrace/otlptracegrpc/client.go 92.5% <100.0%> (ø)
exporters/otlp/otlptrace/otlptracehttp/client.go 76.2% <100.0%> (ø)
sdk/trace/batch_span_processor.go 81.9% <0.0%> (+0.8%) ⬆️

@jmacd
Copy link
Contributor

jmacd commented Sep 7, 2022

I'll say that I didn't understand the reason to make the error private in the first place. @MadVikingGod had asked for a way to access the structure of the error, which is lost when it's internal.
As a user, how will I know the difference between a partial-success error and another kind of error? I'll have to parse the error string which is what @MadVikingGod wanted to avoid, IIUC.

@jmacd
Copy link
Contributor

jmacd commented Sep 7, 2022

@MrAlias why not expose the new error? I can't understand what you're trying to avoid making public. There are good reasons for uses to want to count the number of dropped metric points, which is exactly what this new structure gave you.

@MrAlias
Copy link
Contributor Author

MrAlias commented Sep 7, 2022

@MrAlias why not expose the new error? I can't understand what you're trying to avoid making public. There are good reasons for uses to want to count the number of dropped metric points, which is exactly what this new structure gave you.

Can you help explain the good reasons a users would want this for?

I do not understand what a user will do with this information when it is sent to the global error handler. They cannot retry the send, but all I can see is they can log this information. You don't need all these new exported types to do that.

@jmacd
Copy link
Contributor

jmacd commented Sep 7, 2022

When it comes time to do what I'd like done, it can be done from inside the OTLP exporter with access to the internal package, however it would be great to be able to (a) write alternative instrumentation and (b) experiment with such instrumentation before standardizing it.

The metric I'd like produced is to count the number of spans, metric data points, and logs that are accepted by the server successfully. To do that I want to subtract the number that are dropped due to repeated failure and the number of that were rejected by the server as being malformed in some way. The field contained in the partial-success structure is one of the inputs.

I'm fiercely opposed to excessive logging. Handling errors in this way is already bad, in my opinion. By leaving a publicly-accessible struct here, I'm able to tailor my error handler to suppress partial-success errors, perhaps, meaning I could decide to sample them or rate-limit them differently than unrecognized errors, or even simply just count the number of points rejected.

@MadVikingGod
Copy link
Contributor

Could we split the difference, make an interface that the error will implement, but keep the implementation internal.

This allows this information to be represented in some way beyond just a log, but also minimizes the exposed API surface.

@MrAlias
Copy link
Contributor Author

MrAlias commented Sep 7, 2022

Thanks for providing the use-cases, they are insightful.

When it comes time to do what I'd like done, it can be done from inside the OTLP exporter with access to the internal package, however it would be great to be able to (a) write alternative instrumentation and (b) experiment with such instrumentation before standardizing it.

The metric I'd like produced is to count the number of spans, metric data points, and logs that are accepted by the server successfully. To do that I want to subtract the number that are dropped due to repeated failure and the number of that were rejected by the server as being malformed in some way. The field contained in the partial-success structure is one of the inputs.

Adding metrics to the OTLP exporters is a great idea. I definitely agree we should add this into the exporter itself as the project progresses. However, I do not think the global error handler should be used in the prototyping of this data. Due to the global nature it will not be able to decern where these errors came from and how to allocate what error to what exporter. I would expect wrapping the exporter with a span processor that exposed these metrics to be a better approach (similar to this but it would wrap the SpanExporter itself).

In that situation, the error would need to be returned from the ExportSpans method itself. I would support adding an error to the package API if that were the case.

I'm fiercely opposed to excessive logging. Handling errors in this way is already bad, in my opinion. By leaving a publicly-accessible struct here, I'm able to tailor my error handler to suppress partial-success errors, perhaps, meaning I could decide to sample them or rate-limit them differently than unrecognized errors, or even simply just count the number of points rejected.

I can see the desire to not have excessive logging. But I think we have made a misstep then in using the global error handler here instead of the global logger. If we were to log these things with the structured logging interface it provides it would natively allow loggers with rate limiters to handle this. As it is now we need to export these types, send types along the global error handler, parse the types with a registered error handler, and then send to a rate limited logger.

I think based on these use-cases we may need to not only rethink these types, but I would like to consider again how we are "exporting" the partial success response itself. I think we could refactor these types and use them as return values from ExportSpans and log their presence to better achieve what was proposed in #3106. @jmacd thoughts?

@MrAlias
Copy link
Contributor Author

MrAlias commented Sep 7, 2022

I think we could refactor these types and use them as return values from ExportSpans ...

Also, I think if we do this it should be done at the SDK trace level. That way any exporter would be able to report these errors.

@MadVikingGod
Copy link
Contributor

Also, I think if we do this it should be done at the SDK trace level. That way any exporter would be able to report these errors.

This probably shouldn't be part of the SDK, because this is an applicaiton level error for OTLP. Not every exporter will have this kind of error.

I support this PR because it allows us time to make the "right" decision after this is released. Currently OTLP only returns protocol level errors because there wasn't any application errors. It would be a change of behavior to start returning the new application errors, so I think we need to understand how this might break current usage, if at all.

At some practical level we will have to expose some API to make use of these errors, whether they are exposed via an ErrorHandler or directly from the ExportSpans. I would personally prefer to use the errors.Is(), but an interface with FailedSpanCount() or something similar would be just as effective.

After we expose the error in some way I can see us having a migration path of first expose via otel.Handle() inside the exporter. Next, if we accept the change in behavior, expose the error via returning it and allow the SpanProcessor to call otel.Handle(). And finally, create some tool to wrap the exporter that can measure this with the advantage of know how many spans were sent.

@jmacd
Copy link
Contributor

jmacd commented Sep 7, 2022

global error handler here instead of the global logger

Not sure I agree. If there's a dedicated logger for logging errors produced inside the SDK, then maybe. Without more semantic conventions on errors produced by OTel SDKs, I think these have to be handled specially.

the structured logging interface it provides it would natively allow loggers with rate limiters to handle this

I was expecting code in a handler like:

if errors.Is(err, otlp.PartialSuccessError{}) {
   ... extract count for something useful do not log this
}

Admittedly, use of a global handler is not ideal here. I would want a per-exporter handler. You used the return value from ExportSpans in your example, but think about how this will work for the metrics SDK? I would be glad for a per-exporter ErrorHandler.

FWIW I argued against an unstructured error message being returned in the first place. The actual partial success errors being produced in the Lightstep metrics ingest path are structured to begin with, will say which metrics are failing and for which reasons. However, since the information is so dreadfully repetitive, it responds with one one example at most per response. Then, because OTLP doesn't support that structure, it ends up as a single example of formatted error message. 🤷 I just want to count the number of successful/failed metric points.

@MrAlias
Copy link
Contributor Author

MrAlias commented Sep 7, 2022

Also, I think if we do this it should be done at the SDK trace level. That way any exporter would be able to report these errors.

This probably shouldn't be part of the SDK, because this is an applicaiton level error for OTLP. Not every exporter will have this kind of error.

I don't think that is correct way to think about the problem. If more than one exporter reports partial success, which seems likely, they should both report the same error type. Otherwise code interpreting the OTLP error will now need to be update for every exporter that reports this type of error. This would follow suit in the same way we have Reader errors at the SDK that every `Reader implementation can return.

I support this PR because it allows us time to make the "right" decision after this is released. Currently OTLP only returns protocol level errors because there wasn't any application errors. It would be a change of behavior to start returning the new application errors, so I think we need to understand how this might break current usage, if at all.

I don't follow this. An error returned from ExportSpans is an error from that function call. There is not distinction about the error category, that is the benefit of Go defining errors as interfaces. How the error is handled can depend on the error, but the behavior of returning an error when an error occurred in the function call would remain the same.

After we expose the error in some way I can see us having a migration path of first expose via otel.Handle() inside the exporter. Next, if we accept the change in behavior, expose the error via returning it and allow the SpanProcessor to call otel.Handle(). And finally, create some tool to wrap the exporter that can measure this with the advantage of know how many spans were sent.

Reporting the error with the global error handler now means we will double report the error later or stop reporting the error with the handler in the future. Either are not ideal.

@MrAlias
Copy link
Contributor Author

MrAlias commented Sep 7, 2022

[...] If there's a dedicated logger for logging errors produced inside the SDK, then maybe.

There is, namely this.

the structured logging interface it provides it would natively allow loggers with rate limiters to handle this

I was expecting code in a handler like:

if errors.Is(err, otlp.PartialSuccessError{}) {
   ... extract count for something useful do not log this
}

Right, this is what I was expecting. But if you used the error logger to log this event that code would not be needed, nor the new PartialSuccessError.

Admittedly, use of a global handler is not ideal here. I would want a per-exporter handler.

I must be missing something. The caller of the ExportSpan function is the per-exporter handler. That function receives an error as a return value from calling the function and determines how it should be handled.

You used the return value from ExportSpans in your example, but think about how this will work for the metrics SDK? I would be glad for a per-exporter ErrorHandler.

Why would this not work in the metric SDK? An error is similarly returned from Export?

Export(context.Context, metricdata.ResourceMetrics) error

I just want to count the number of successful/failed metric points.

I think this is key! I want that as well (and so did @dashpole). I see returning this information as an error from the call to ExportSpans as the best way to do this. Even when we add metrics about the partial success to the exporter, what if a user wanted to count these number and log them locally or report them to the otel ErrorHandler. They would still be able to do that with a SpanProcessor wrapping the SpanExporter. It provides the most functionality and follows standard Go error handling practices.

@jmacd
Copy link
Contributor

jmacd commented Sep 7, 2022

I think you mean to replace the PartialSuccessError with either something like:

  otel.Info("partial success", attribute.Int("number_rejected", ...), attribute.String("error_message", ...), attribute.String("signal", "metrics")

or maybe

  otel.Error(PartialSuccess{...}, "exporter partial success", attribute.String("signal", "metrics"))

Those are OK with me.

I don't think we should be RETURNING these as errors because the export itself is not failing, so I don't expect to have to provide a new span-exporter, metrics-exporter, etc., just in order to get these messages to show on the console.

Later, someone in OTel will organize a way to report on each signal -- at that point, where a non-nil error means a total failure, we will I assume have to inject some kind of error-handler to count the total loss of a batch of spans/metrics/logs, etc.

@MrAlias
Copy link
Contributor Author

MrAlias commented Sep 7, 2022

I think you mean to replace the PartialSuccessError with either something like:

  otel.Info("partial success", attribute.Int("number_rejected", ...), attribute.String("error_message", ...), attribute.String("signal", "metrics")

or maybe

  otel.Error(PartialSuccess{...}, "exporter partial success", attribute.String("signal", "metrics"))

Those are OK with me.

Right. This was the approach I would defer to if we wanted to support logging (rate limited, or otherwise) of the partial success.

I don't think we should be RETURNING these as errors because the export itself is not failing

I disagree. The export network call may not have failed, but the call to ExportSpans did fail. The payload the caller passed did not successfully export.

Errors in Go are used to communicate when aberrant and unexpected events happen. Similar to an io.Writer, io.Reader, "html/template".ExecuteTemplate, or "text/template".ExecuteTemplate, returning an error explaining why part of the passed data was not handled is a common Go practice.

so I don't expect to have to provide a new span-exporter, metrics-exporter, etc., just in order to get these messages to show on the console.

I don't follow this. Our current behavior of the simple span processor and the batch span processor send ExportSpans errors to the global error handler. Why would you need to provide a new span-exporter?

@MrAlias
Copy link
Contributor Author

MrAlias commented Sep 8, 2022

I created a proof-of-concept for how returning the error from a call to ExportSpan will allow span processor creation that where metric export experiments can happen and the default span processors will still register the error with the default handler.

@MrAlias MrAlias merged commit 13906ac into open-telemetry:main Sep 8, 2022
@MrAlias MrAlias deleted the mv-otlp-part-success branch September 8, 2022 17:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pkg:exporter:otlp Related to the OTLP exporter package Skip Changelog PRs that do not require a CHANGELOG.md entry
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants